pacman::p_load(olsrr, corrplot, ggpubr, sf, spdep, GWmodel, tmap, tidyverse, gtsummary)Take Home Exercise 03: Prototyping Modules for Geospatial Analytics Shiny Application
1. Overview
In this take-home exercise, I will focus on prototyping Geographically Weighted Regression (GWR) models for my group’s Shiny App. GWR is a spatial statistical method that accounts for non-stationary variables (such as climate, demographics, and physical environment characteristics) and models the local relationships between these independent variables and the outcome of interest. In this case, the dependent variable is the rental price of HDB flats in Singapore, and I will examine how factors such as flat size, proximity to MRT and CBD, remaining lease, storey height, and more influence HDB rental prices.The data preparation and Exploratory Data Analysis were handled by my groupmate, so for this exercise, I will load the data directly from an RDS file.
2 The R-Packages
tidyverse: attribute data handling
3. The Data
3.1 Aspatial Data
First, import the rental dataset, as the data wrangling was done by teammate. Please refer to here for details.
rental.sf=> contains the rental data from Jan 2020 to Sept 2024, as well as other fields like:Dependent:
- Monthly Rental fee:
monthly_rent
- Monthly Rental fee:
Continuous:
Proximity measure: kindergarten, childcare, hawker, bus stops, shopping mall, mrt, primary schools, cbd
Count of amenities within specific distance: kindergarten, childcare, hawker, bus stops, shopping mall,
Categorical:
Flat Type:
flat_typeTown:
townRegion:
region
rental.sf <- read_rds("data/rds/rental_sf.rds")head(rental.sf)Simple feature collection with 6 features and 18 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 15786.61 ymin: 30769.77 xmax: 39668.39 ymax: 45634.94
Projected CRS: SVY21 / Singapore TM
# A tibble: 6 × 19
rent_approval_date town flat_type monthly_rent geometry
<date> <chr> <chr> <dbl> <POINT [m]>
1 2024-01-01 YISHUN 4-ROOM 3700 (29463.95 45634.94)
2 2024-01-01 JURONG WE… 4-ROOM 3900 (15786.61 34389.18)
3 2024-01-01 SENGKANG 5-ROOM 3700 (33244.8 42043.31)
4 2024-01-01 KALLANG/W… 3-ROOM 3600 (30102.41 32911.75)
5 2024-01-01 BEDOK 5-ROOM 4350 (39668.39 33918.01)
6 2024-01-01 QUEENSTOWN 4-ROOM 3000 (25265.85 30769.77)
# ℹ 14 more variables: region <chr>, no_of_kindergarten_500m <dbl>,
# prox_kindergarten <dbl>, no_of_childcare_500m <dbl>, prox_childcare <dbl>,
# no_of_hawker_500m <dbl>, prox_hawker <dbl>, no_of_busstop_500m <dbl>,
# prox_busstop <dbl>, no_of_shoppingmall_1km <dbl>, prox_shoppingmall <dbl>,
# prox_mrt <dbl>, prox_prisch <dbl>, prox_cbd <dbl>
st_crs(rental.sf)Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
Notice that our rental.sf was in EPSG 3414.
3.2 Geospatial Data
Using st_read() of sf package to import the MP14_SUBZONE_WEB_PL shapefile.
mpsz = st_read(dsn = "data/geospatial", layer = "MP14_SUBZONE_WEB_PL")Reading layer `MP14_SUBZONE_WEB_PL' from data source
`/Users/mingwei/Desktop/SMU/Y3S1/IS415/xXxPMWxXx/IS415-GAA/Take-home_Ex/Take-home_Ex03/data/geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21
As the aspatial data we going to use was in EPSG=3414, the code chunk below will transform mpsz object to ESPG code = 3414 using st_transform() method of sf package.
mpsz_svy21 <- st_transform(mpsz, 3414)st_crs(mpsz_svy21)Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
Notice that the EPSG is now in 3414, same as rental.sf. We are good to go.
4. Hedonic Pricing Modelling
4.1 Simple Linear Regression Method
To start off, lest build a simple linear regression model by using monthly_rent as the dependent variable. Since the dataset, only have flat type, for exploring, I will use prox_mrt as the independent variable.
I will try out simple linear regression, but I won’t dive deeply into it, as this method will not be part of our Shiny App. Instead, I will focus more on multiple linear regression method and building Hedonic Pricing Models using the GWmodel package in the next section. Which I will explore the different arguments available so that we can include them in our Shiny App.
hdb.slr <- lm(formula=monthly_rent ~ prox_mrt, data = rental.sf)lm() returns an object of class “lm” or for multiple responses of class c(“mlm”, “lm”).
The functions summary() and anova() can be used to obtain and print a summary and analysis of variance table of the results. The generic accessor functions coefficients, effects, fitted.values and residuals extract various useful features of the value returned by lm.
summary(hdb.slr)
Call:
lm(formula = resale_price ~ floor_area_sqft, data = resale.sf)
Residuals:
Min 1Q Median 3Q Max
-469929 -86856 -34413 44193 893943
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 103111.3 3811.1 27.05 <2e-16 ***
floor_area_sqft 490.2 3.6 136.16 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 137200 on 21560 degrees of freedom
Multiple R-squared: 0.4623, Adjusted R-squared: 0.4623
F-statistic: 1.854e+04 on 1 and 21560 DF, p-value: < 2.2e-16
The output report reveals that the monthly_rent can be explained by using the formula:
monthly_rent = 3200.84161 - 0.20003(prox_mrt)
The coefficients section, the p-value for the hypothesis test that the coefficient is equal to zero. Since both values are less than 0.001, both the intercept and the floor area are statistically significant
With the multiple R-squared of 0.01045, Indicates that about 1.04% of the variability in rental prices is explained by the model. This suggests that other factors likely influence rental prices, as over 99% of the variability remains unexplained.
Since p-value is much smaller than 0.0001, we will reject the null hypothesis that mean is a good estimator of monthly_rent. This will allow us to infer that simple linear regression model above is a good estimator of monthly_rent.
To visualise the best fit curve on a scatterplot, we can incorporate lm() as a method function in ggplot’s geometry as shown in the code chunk below.
Click to expand/collapse code
ggplot(data=rental.sf,
aes(x=`prox_mrt`, y=`monthly_rent`)) +
geom_point() +
geom_smooth(method = lm)
We can see from above that, using prox_mrt variable is not really accurate for simple linear regression model.
4.2 Multiple Linear Regression
4.2.1 Visualising the Relationships of the Independent Variables
Before constructing a multiple regression model, it’s crucial to verify that the independent variables are not highly correlated with one another. Using highly correlated independent variables by mistake can undermine the model’s quality. This issue is referred to as multicollinearity in statistics.
A correlation matrix is often utilized to visualize the relationships among independent variables. In addition to R’s pairs() function, there are several packages available that facilitate the display of a correlation matrix. In this section, we will use the corrplot package.
First, lets check the column for the rental.sf.
names(rental.sf) [1] "rent_approval_date" "town"
[3] "flat_type" "monthly_rent"
[5] "geometry" "region"
[7] "no_of_kindergarten_500m" "prox_kindergarten"
[9] "no_of_childcare_500m" "prox_childcare"
[11] "no_of_hawker_500m" "prox_hawker"
[13] "no_of_busstop_500m" "prox_busstop"
[15] "no_of_shoppingmall_1km" "prox_shoppingmall"
[17] "prox_mrt" "prox_prisch"
[19] "prox_cbd"
First, lets pick all the numeric independent variables.
independent_columns <- rental.sf %>%
select(7:19) %>%
st_drop_geometry()The code chunk below is used to plot a scatterplot matrix of the relationship between the independent variables in rental.sf data.frame.
corrplot(cor(independent_columns), diag = FALSE, order = "AOE",
tl.pos = "td", tl.cex = 0.5, method = "number", type = "upper")
corrplot(cor(independent_columns), diag = FALSE, order = "FPC",
tl.pos = "td", tl.cex = 0.5, method = "number", type = "upper")
corrplot(cor(independent_columns), diag = FALSE, order = "AOE",
tl.pos = "ld", tl.cex = 0.5, method = "number", type = "lower")
corrplot(cor(independent_columns), diag = FALSE, order = "AOE",
tl.pos = "full", tl.cex = 0.5, method = "number", type = "full")
corrplot(cor(independent_columns), diag = FALSE, order = "AOE",
tl.pos = "td", tl.cex = 0.5, method = "circle", type = "upper")
After reviewing the documentation for the corrplot package and examining the plot above, I believe it would be beneficial to create a page where users can interact with various arguments to explore the correlation matrix of the independent variables.
There are different input/option for the arguments such as:
methodcircle (default)
square
ellipse
number
pie
shade
color
typefull (default) => tl.pos need to set as “full”, cannot set as “td” as type = “upper”
upper => tl.pos = “td”
lower => tl.pos = “ld”
orderoriginal (default) => orginal order
AOE => angular order of the eigenvectors
FPC => first principal component order
hclust => for the hierarchical clustering order
hclust.method: when the order is hclust, the below method can be define- ‘ward’ , ward.D’, ‘ward.D2’, ‘single’, ‘complete’, ‘average’, ‘mcquitty’, ‘median’ or ‘centroid’
alphabet => alphabetical order
The below are the UI design for our Shiny App, allowing users to interactively explore the relationships among the independent variables.
4.2.2 Building the Hedonic Pricing Model using Multiple Linear Regression Method
names(independent_columns) [1] "floor_area_sqft" "distance_to_mrt_meters"
[3] "distance_to_cbd" "distance_to_pri_school_meters"
[5] "housing_type" "remaining_lease_total_months"
[7] "remaining_lease_years" "floor_7_to_9"
[9] "floor_4_to_6" "floor_1_to_3"
[11] "floor_16_and_above" "floor_10_to_12"
[13] "floor_13_to_15" "no_of_kindergarten_500m"
[15] "prox_kindergarten" "no_of_childcare_500m"
[17] "prox_childcare" "no_of_hawker_500m"
[19] "prox_hawker" "no_of_busstop_500m"
[21] "prox_busstop" "no_of_shoppingmall_1km"
[23] "prox_shoppingmall"
Click to expand/collapse code
rental_mlr <- lm(formula = monthly_rent ~ no_of_kindergarten_500m + prox_kindergarten +
no_of_childcare_500m + prox_childcare + no_of_hawker_500m
+ prox_hawker + no_of_busstop_500m + prox_busstop
+ no_of_shoppingmall_1km + prox_shoppingmall + prox_mrt
+ prox_prisch + prox_cbd,
data = rental.sf)
summary(rental_mlr)
Call:
lm(formula = monthly_rent ~ no_of_kindergarten_500m + prox_kindergarten +
no_of_childcare_500m + prox_childcare + no_of_hawker_500m +
prox_hawker + no_of_busstop_500m + prox_busstop + no_of_shoppingmall_1km +
prox_shoppingmall + prox_mrt + prox_prisch + prox_cbd, data = rental.sf)
Residuals:
Min 1Q Median 3Q Max
-2621.8 -377.5 15.4 415.4 3361.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.432e+03 3.317e+01 103.469 < 2e-16 ***
no_of_kindergarten_500m 7.354e+00 3.579e+00 2.055 0.0399 *
prox_kindergarten -4.321e-02 2.791e-02 -1.548 0.1216
no_of_childcare_500m 4.975e-01 1.449e+00 0.343 0.7314
prox_childcare -4.983e-02 4.756e-02 -1.048 0.2948
no_of_hawker_500m -4.873e+01 6.515e+00 -7.479 7.72e-14 ***
prox_hawker 4.780e-02 1.078e-02 4.432 9.36e-06 ***
no_of_busstop_500m 7.486e+00 1.057e+00 7.085 1.42e-12 ***
prox_busstop 1.836e-01 7.566e-02 2.426 0.0153 *
no_of_shoppingmall_1km -6.699e+00 3.549e+00 -1.887 0.0591 .
prox_shoppingmall -1.216e-01 1.415e-02 -8.598 < 2e-16 ***
prox_mrt -1.248e-01 1.327e-02 -9.411 < 2e-16 ***
prox_prisch 8.372e-04 1.588e-02 0.053 0.9580
prox_cbd -2.798e-02 1.118e-03 -25.034 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 646.5 on 25699 degrees of freedom
Multiple R-squared: 0.03757, Adjusted R-squared: 0.03708
F-statistic: 77.17 on 13 and 25699 DF, p-value: < 2.2e-16
Based on the coefficients section, we can see that not all the independent variables are statistically significant, some variable we can remove from our model. Based on the Pr (p-value) field, we can remove prox_kindergarten, no_of_childcare_500m, prox_childcare, no_of_shoppingmall_1km and prox_prisch field.
Notice the adjused R-squared value is only 0.03708.
4.2.3 Preparing Publication Quality Table: olsrr method
Lets. update the model by removing the 4 variables which are not statistically significant and display the Publication Quality Table using ols_regress() of olsrr package
Click to expand/collapse code
rental_mlr <- lm(formula = monthly_rent ~ no_of_kindergarten_500m +
+ no_of_hawker_500m
+ prox_hawker + no_of_busstop_500m + prox_busstop
+ prox_shoppingmall + prox_mrt
+ prox_cbd,
data = rental.sf)
ols_regress(rental_mlr) Model Summary
--------------------------------------------------------------------
R 0.193 RMSE 646.509
R-Squared 0.037 MSE 418120.408
Adj. R-Squared 0.037 Coef. Var 20.847
Pred R-Squared 0.036 AIC 405798.183
MAE 498.559 SBC 405879.730
--------------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
--------------------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
--------------------------------------------------------------------------------
Regression 414827561.762 8 51853445.220 124.016 0.0000
Residual 10747366972.513 25704 418120.408
Total 11162194534.275 25712
--------------------------------------------------------------------------------
Parameter Estimates
-----------------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-----------------------------------------------------------------------------------------------------------
(Intercept) 3386.962 27.314 124.001 0.000 3333.425 3440.499
no_of_kindergarten_500m 10.715 2.841 0.024 3.771 0.000 5.146 16.284
no_of_hawker_500m -48.313 6.504 -0.062 -7.428 0.000 -61.061 -35.565
prox_hawker 0.050 0.011 0.038 4.646 0.000 0.029 0.071
no_of_busstop_500m 7.354 1.029 0.050 7.146 0.000 5.337 9.371
prox_busstop 0.181 0.076 0.015 2.391 0.017 0.033 0.329
prox_shoppingmall -0.109 0.011 -0.065 -9.595 0.000 -0.131 -0.087
prox_mrt -0.125 0.013 -0.064 -9.831 0.000 -0.150 -0.100
prox_cbd -0.028 0.001 -0.192 -25.274 0.000 -0.030 -0.026
-----------------------------------------------------------------------------------------------------------
4.2.4 Testing
To ensure that our multiple linear regression model are good to go. We need to perform the following test:
Test of Multicollinearity
Test of Non-Linearity
Test for Normality Assumption
Test for Spatial Autocorrelation
4.2.4.1 Test of Multicollinearity
The code chunk below, the ols_vif_tol() of olsrr package is used to test if there are sign of multicollinearity.
To check if the independent variables in the model are highly correlated with each other. They should not be highly correlated, as it will leads to less reliable coefficient estimates.
ols_vif_tol(rental_mlr) Variables Tolerance VIF
1 no_of_kindergarten_500m 0.9497317 1.052929
2 no_of_hawker_500m 0.5330345 1.876051
3 prox_hawker 0.5725711 1.746508
4 no_of_busstop_500m 0.7554615 1.323694
5 prox_busstop 0.9180374 1.089280
6 prox_shoppingmall 0.8053481 1.241699
7 prox_mrt 0.8860694 1.128580
8 prox_cbd 0.6468219 1.546021
Since the VIF of the independent variables are less than 10. We can safely conclude that there are no sign of multicollinearity among the independent variables.
4.2.4.2 Test for Non-Learity
To check the relationship between the dependent variable (e.g., rental price) and the independent variables (e.g., floor area, distance to MRT) is linear. If the relationship is non-linear, the model will not accurately capture the patterns in the data.
The code chunk below, the ols_plot_resid_fit() of olsrr package is used to perform linearity assumption test.
ols_plot_resid_fit(rental_mlr)
The figure above reveals that most of the data poitns are scattered around the 0 line, hence we can safely conclude that the relationships between the dependent variable and independent variables are linear.
4.2.4.3 Test for Normality Assumption
The residuals (the differences between observed and predicted values) in linear regression are assumed to be normally distributed. If the residuals deviate significantly from normality, it may indicate problems such as outliers or influential points.
The code chunk below uses ols_plot_resid_hist() of olsrr package to perform normality assumption test.
ols_plot_resid_hist(rental_mlr)
As we can see, the residual of the multiple linear regression model resembles a normal distribution.
4.2.4.4 Test for Spatial Autocorrelation
In spatial data, nearby observations may not be independent of each other, violating the assumption of independent observations. If residuals are spatially autocorrelated, it suggests that the model is missing key spatial relationships, leading to biased or inefficient estimates.
mlr.output <- as.data.frame(rental_mlr$residuals)
rental.res.sf <- cbind(rental.sf,
rental_mlr$residuals) %>%
rename(`MLR_RES` = `rental_mlr.residuals`)let’s display the distribution of the residuals on an interactive map:
tmap_mode("view")
tm_shape(mpsz_svy21)+
tmap_options(check.and.fix = TRUE) +
tm_polygons(alpha = 0.4) +
tm_shape(rental.res.sf) +
tm_dots(col = "MLR_RES",
alpha = 0.6,
style="quantile") +
tm_view(set.zoom.limits = c(11,14))